Bug #36709
OSD stuck while flushing rocksdb WAL
0%
Description
Hi all,
We use:
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
Clients work on:
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
We see that sometimes random osd's with rocksdb and filestore stuck with these lines in a log:
2018-11-06 12:03:12.583520 7fd28b81b700 0 log_channel(cluster) log [DBG] : 14.95c scrub starts 2018-11-06 12:03:15.855132 7fd28f823700 0 log_channel(cluster) log [DBG] : 14.95c scrub ok 2018-11-06 12:03:17.584285 7fd290024700 0 log_channel(cluster) log [DBG] : 14.292b scrub starts 2018-11-06 12:03:21.264719 7fd290024700 0 log_channel(cluster) log [DBG] : 14.292b scrub ok 2018-11-06 12:03:23.585003 7fd28b01a700 0 log_channel(cluster) log [DBG] : 14.2e05 scrub starts 2018-11-06 12:03:28.201684 7fd28f022700 0 log_channel(cluster) log [DBG] : 14.2e05 scrub ok 2018-11-06 12:03:54.286060 7fd2a484d700 4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/db_impl_write.cc:725] [default] New memtable created with log file: #79900. Immutable memtables: 0. 2018-11-06 12:03:54.286115 7fd287813700 4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:49] [JOB 950] Syncing log #79898 2018-11-06 12:03:54.286518 7fd287813700 4 rocksdb: (Original Log Time 2018/11/06-12:03:54.286103) [/build/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:1158] Calling FlushMemTableToOutputFile with colu mn family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1 2018-11-06 12:03:54.286527 7fd287813700 4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/flush_job.cc:264] [default] [JOB 950] Flushing memtable with next log file: 79900 2018-11-06 12:03:54.286541 7fd287813700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1541495034286533, "job": 950, "event": "flush_started", "num_memtables": 1, "num_entries": 370287, "num_deletes": 146919, "memory_usage": 65025896} 2018-11-06 12:03:54.286547 7fd287813700 4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/flush_job.cc:293] [default] [JOB 950] Level-0 flush table #79901: started 2018-11-06 12:03:54.452365 7fd287813700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1541495034452352, "cf_name": "default", "job": 950, "event": "table_file_creation", "file_number": 79901, "file_size": 14836047, "table_properties": {"data_size": 13956943, "index_size": 281340, "filter_size": 596833, "raw_key_size": 15773747, "raw_average_key_size": 69, "raw_value_size": 10003671, "raw_average_value_size": 44, "num_data_blocks": 3525, "num_entries": 225940, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "146919", "kMergeOperands": "0"}} 2018-11-06 12:03:54.452385 7fd287813700 4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/flush_job.cc:319] [default] [JOB 950] Level-0 flush table #79901: 14836047 bytes OK 2018-11-06 12:03:54.452722 7fd287813700 4 rocksdb: (Original Log Time 2018/11/06-12:03:54.452394) [/build/ceph-12.2.8/src/rocksdb/db/memtable_list.cc:360] [default] Level-0 commit table #79901 started 2018-11-06 12:03:54.452728 7fd287813700 4 rocksdb: (Original Log Time 2018/11/06-12:03:54.452668) [/build/ceph-12.2.8/src/rocksdb/db/memtable_list.cc:383] [default] Level-0 commit table #79901: memtable #1 done 2018-11-06 12:03:54.452732 7fd287813700 4 rocksdb: (Original Log Time 2018/11/06-12:03:54.452678) EVENT_LOG_v1 {"time_micros": 1541495034452674, "job": 950, "event": "flush_finished", "lsm_state": [2, 2, 5, 0, 0, 0, 0], "immutable_memtables": 0} 2018-11-06 12:03:54.452735 7fd287813700 4 rocksdb: (Original Log Time 2018/11/06-12:03:54.452706) [/build/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:132] [default] Level summary: base level 1 max bytes base 268435456 files[2 2 5 0 0 0 0] max score 0.93 2018-11-06 12:03:54.452742 7fd287813700 4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/db_impl_files.cc:388] [JOB 950] Try to delete WAL files size 56883285, prev total WAL file size 56916690, number of live WAL files 2.
There're no "blocked operations" message in ceph status, but some write requests are stuck for really long time (seconds if not minutes).
The only way to find such osd's while problem - is to mark some osd's from pool as down. It triggers peering and those osd's are found, because they stuck in peering state.
Last time we saw that two pg's on one osd has been stuck in snaptrim:
~# ceph pg dump | grep snaptrim dumped all 14.99f 115 0 0 0 0 1399443474 1585 1585 active+clean+snaptrim 2018-11-06 10:45:46.177039 1318693'8143685 1318693:11024026 [225,527,628] 225 [225,527,628] 225 1318210'8139261 2018-11-05 23:11:26.827327 0'0 2016-10-14 07:59:40.557778 20 14.129b 125 0 0 0 0 1768898560 1562 1562 active+clean+snaptrim 2018-11-06 10:45:12.233255 1318693'18374010 1318693:16891803 [527,225,650] 527 [527,225,650] 527 1318190'18342505 2018-11-05 06:48:51.767389 0'0 2016-10-14 07:59:40.567406 26
After we marked it down (ceph osd down osd.225), everything got back to normal state.
Hope for any help.
If any additional info is needed, please ask.
History
#1 Updated by Greg Farnum over 5 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)
#2 Updated by Sage Weil over 5 years ago
- Status changed from New to Need More Info
I'm not sure know rocksdb is what's stuck.. can you dump 'ceph daemon osd.NNN ops' to see what state teh oeprations are in? Or even gdb attach to teh process and 'thr app all bt' to dump all threads?
#3 Updated by Aleksei Zakharov over 5 years ago
- File osd_pid.20503.backtrace added
I've finally reproduced this(as i hope) behavior.
Our staging cluster:
3 nodes with 22 osds on ssd's,
Kernel 4.15.0-32-generic #35~16.04.1-Ubuntu
Cluster has old clients support:
require_min_compat_client hammer min_compat_client hammer require_osd_release luminous
and crush tunables are set to firefly.
~# ceph osd crush show-tunables { "choose_local_tries": 0, "choose_local_fallback_tries": 0, "choose_total_tries": 50, "chooseleaf_descend_once": 1, "chooseleaf_vary_r": 1, "chooseleaf_stable": 0, "straw_calc_version": 1, "allowed_bucket_algs": 22, "profile": "firefly", "optimal_tunables": 0, "legacy_tunables": 0, "minimum_required_version": "hammer", "require_feature_tunables": 1, "require_feature_tunables2": 1, "has_v2_rules": 0, "require_feature_tunables3": 1, "has_v3_rules": 0, "has_v4_buckets": 1, "require_feature_tunables5": 0, "has_v5_rules": 0 }
ceph version (clients,osds and monitors):
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
I run fio from 22 virtual machines with this command line:
fio --name=test --ioengine=libaio --iodepth=1 --direct=1 --filename=/root/fiofile --filesize=5G --runtime=120 --blocksize=4k --rw=randwrite
When it works in normal way - it shows 4Kiops from every virtual machine, which is qemu limit we've set.
1/2 times it stucks. sd 2:0:0:0: [sda] tag#1 abort messages could be found in VM dmesg. iostat -xtd 1 shows:
12/12/2018 12:47:23 PM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.06 0.00 0.00 0.00 0.00 101.01
When this happens, i mark all osds down on one host. For some reason, osd's stuck only on another host.
After i mark all osd's on the first host down, i wait some time and check ceph health detail:
~# ceph health detail HEALTH_WARN noout,nodeep-scrub flag(s) set; Reduced data availability: 35 pgs inactive, 35 pgs peering; 40482 slow requests are blocked > 32 sec. Implicated osds 32,34,35,38,39,40,41,42,43,44,46,48,52,74,76,77,78,79,80,81,89,92 OSDMAP_FLAGS noout,nodeep-scrub flag(s) set PG_AVAILABILITY Reduced data availability: 35 pgs inactive, 35 pgs peering pg 50.15 is stuck peering for 252.432610, current state peering, last acting [89,33,52] pg 50.3c is stuck peering for 256.026446, current state peering, last acting [81,32,52] pg 50.4d is stuck peering for 273.949284, current state peering, last acting [74,52,30] pg 50.92 is stuck peering for 243.807450, current state peering, last acting [78,36,52] pg 50.bb is stuck peering for 254.226990, current state peering, last acting [32,96,52] pg 50.c1 is stuck peering for 224.870375, current state peering, last acting [41,52,93] pg 50.d3 is stuck peering for 250.713368, current state peering, last acting [79,34,52] pg 50.f0 is stuck peering for 226.680818, current state peering, last acting [41,52,80] pg 50.106 is stuck peering for 253.985154, current state peering, last acting [92,52,33] pg 50.130 is stuck peering for 203.517863, current state peering, last acting [48,52,75] pg 50.137 is stuck peering for 206.796044, current state peering, last acting [46,52,92] pg 50.149 is stuck peering for 226.680859, current state peering, last acting [41,52,74] pg 50.161 is stuck peering for 224.242458, current state peering, last acting [42,52,90] pg 50.169 is stuck peering for 235.408025, current state peering, last acting [38,79,52] pg 50.183 is stuck peering for 234.849956, current state peering, last acting [38,52,76] pg 50.1ee is stuck peering for 261.064377, current state peering, last acting [80,31,52] pg 50.20a is stuck peering for 250.713153, current state peering, last acting [78,34,52] pg 50.225 is stuck peering for 248.652645, current state peering, last acting [34,91,52] pg 50.246 is stuck peering for 217.827022, current state peering, last acting [44,95,52] pg 50.28b is stuck peering for 219.954617, current state peering, last acting [43,91,52] pg 50.294 is stuck peering for 232.881190, current state peering, last acting [39,52,78] pg 50.2ae is stuck peering for 230.378480, current state peering, last acting [40,75,52] pg 50.2cd is stuck peering for 243.839058, current state peering, last acting [35,52,74] pg 50.2d1 is stuck peering for 248.652355, current state peering, last acting [34,75,52] pg 50.2f3 is stuck peering for 226.680902, current state peering, last acting [41,83,52] pg 50.302 is stuck peering for 243.838994, current state peering, last acting [35,52,94] pg 50.309 is stuck peering for 226.681108, current state peering, last acting [41,84,52] pg 50.311 is stuck peering for 256.257842, current state peering, last acting [76,52,32] pg 50.313 is stuck peering for 203.517920, current state peering, last acting [48,52,80] pg 50.366 is stuck peering for 224.242326, current state peering, last acting [42,86,52] pg 50.377 is stuck peering for 248.652813, current state peering, last acting [34,52,94] pg 50.39d is stuck peering for 261.048597, current state peering, last acting [77,31,52] pg 50.3b4 is stuck peering for 254.089555, current state peering, last acting [80,33,52] pg 50.3cc is stuck peering for 203.518124, current state peering, last acting [48,95,52] pg 50.3e4 is stuck peering for 243.839263, current state peering, last acting [35,83,52] REQUEST_SLOW 40482 slow requests are blocked > 32 sec. Implicated osds 32,34,35,38,39,40,41,42,43,44,46,48,52,74,76,77,78,79,80,81,89,92 4265 ops are blocked > 524.288 sec 32214 ops are blocked > 262.144 sec 3922 ops are blocked > 131.072 sec 54 ops are blocked > 65.536 sec 27 ops are blocked > 32.768 sec osds 32,34,35,38,39,40,41,42,43,44,46,48,52 have blocked requests > 262.144 sec osds 74,76,77,78,79,80,81,89,92 have blocked requests > 524.288 sec
It's easy to see that pg's which are blocked in peering has osd.52 in acting set.
If i mark this osd (osd.52 in example) down or restart the service - problem disappears and pool operates normally.
If i wait, osd could start working in normal way, but sometimes, osd dies with:
1. A lot of no reply messages in logs:
... -153> 2018-12-12 14:43:39.233249 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6844 osd.31 since back 2018-12-12 14:41:30.170055 front 2018-12-12 14:41:30.170055 (cutoff 2018-12-12 14:43:19.233213) -152> 2018-12-12 14:43:39.233258 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6803 osd.33 since back 2018-12-12 14:41:41.685147 front 2018-12-12 14:41:41.685147 (cutoff 2018-12-12 14:43:19.233213) -151> 2018-12-12 14:43:39.233284 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6805 osd.34 since back 2018-12-12 14:41:41.685147 front 2018-12-12 14:41:41.685147 (cutoff 2018-12-12 14:43:19.233213) -150> 2018-12-12 14:43:39.233291 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6807 osd.35 since back 2018-12-12 14:41:46.388834 front 2018-12-12 14:41:46.388834 (cutoff 2018-12-12 14:43:19.233213) -149> 2018-12-12 14:43:39.233301 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6809 osd.36 since back 2018-12-12 14:41:50.396253 front 2018-12-12 14:41:50.396253 (cutoff 2018-12-12 14:43:19.233213) -148> 2018-12-12 14:43:39.233309 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6815 osd.37 since back 2018-12-12 14:41:50.396253 front 2018-12-12 14:41:50.396253 (cutoff 2018-12-12 14:43:19.233213) -147> 2018-12-12 14:43:39.233318 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6811 osd.38 since back 2018-12-12 14:41:53.900087 front 2018-12-12 14:41:53.900087 (cutoff 2018-12-12 14:43:19.233213) ...
2. And a call trace after that:
0> 2018-12-11 19:32:09.119042 7f5b31b04700 -1 *** Caught signal (Aborted) ** in thread 7f5b31b04700 thread_name:tp_osd_tp ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) 1: (()+0xa87324) [0x55b4cca24324] 2: (()+0x11390) [0x7f5b544a4390] 3: (pthread_cond_wait()+0xc0) [0x7f5b544a0360] 4: (Throttle::_wait(long)+0x34d) [0x55b4cca5c97d] 5: (Throttle::get(long, long)+0x2e5) [0x55b4cca5d7a5] 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0xb55) [0x55 b4cc91e6c5] 7: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x66) [0x55b4cc63fb26] 8: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t cons t&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcb8) [0x55b4cc776278] 9: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x9fa) [0x55b4cc5dbc2a] 10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x139a) [0x55b4cc6244ca] 11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ff4) [0x55b4cc627e24] 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xea9) [0x55b4cc5e1f69] 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55b4cc4555f9] 14: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55b4cc703327] 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1047) [0x55b4cc483a17] 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55b4cca6c204] 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b4cca6f240] 18: (()+0x76ba) [0x7f5b5449a6ba] 19: (clone()+0x6d) [0x7f5b5351141d]
There're no error messages in dmesg on host, there's no heavy load. It looks like host do nothing at that moment.
I've attached gdb backtrace.
Operations dump is too big (1.6M tar), so it is available here - https://166924.selcdn.ru/links/osd.52.operations.dump.tar.gz
In our production cluster where we see the problem often, on client side we use:
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
And osds version is:
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
~# ceph osd crush show-tunables { "choose_local_tries": 0, "choose_local_fallback_tries": 0, "choose_total_tries": 50, "chooseleaf_descend_once": 1, "chooseleaf_vary_r": 1, "chooseleaf_stable": 0, "straw_calc_version": 2, "allowed_bucket_algs": 54, "profile": "hammer", "optimal_tunables": 0, "legacy_tunables": 0, "minimum_required_version": "hammer", "require_feature_tunables": 1, "require_feature_tunables2": 1, "has_v2_rules": 0, "require_feature_tunables3": 1, "has_v3_rules": 0, "has_v4_buckets": 1, "require_feature_tunables5": 0, "has_v5_rules": 0 }
We can't see any blocked ops when problem appears in production. Also, latency metrics are ok, but some virtual machines stuck on IO operations the same way, as it is described before. We've never seen that trace call in production. Probably, it depends on the load.
If any additional info is needed - please, ask.
#4 Updated by Josh Durgin over 5 years ago
The backtrace that's attached shows the kv_sync_thread waiting for I/O to complete from the block device:
Thread 17 (Thread 0x7f3826c25700 (LWP 20542)): #0 0x00007f383759d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007f3836ea391c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #2 0x0000560710a8903d in IOContext::aio_wait() () #3 0x0000560710a8ec4d in BlueFS::wait_for_aio(BlueFS::FileWriter*) () #4 0x0000560710a91bc3 in BlueFS::_flush_bdev_safely(BlueFS::FileWriter*) () #5 0x0000560710aa5b9d in BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&) () #6 0x0000560710ab6e53 in BlueRocksWritableFile::Sync() () #7 0x0000560710efb0bc in rocksdb::WritableFileWriter::SyncInternal(bool) () #8 0x0000560710efdc92 in rocksdb::WritableFileWriter::Sync(bool) () #9 0x0000560710f54a28 in rocksdb::DBImpl::WriteToWAL(rocksdb::autovector<rocksdb::WriteThread::Writer*, 8ul> const&, rocksdb::log::Writer*, bool, bool, unsigned long) () #10 0x0000560710f55f57 in rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool) () #11 0x0000560710f56ada in rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) () #12 0x0000560710a1d2cf in RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>) () #13 0x00005607109ba0ff in BlueStore::_kv_sync_thread() () #14 0x00005607109f9b2d in BlueStore::KVSyncThread::entry() () #15 0x00007f38375976ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #16 0x00007f383660e41d in clone () from /lib/x86_64-linux-gnu/libc.so.6
other threads are waiting on throttles, presumably since the queues are full. This looks like the OSD disk is the bottleneck. What does iostat and CPU usage look like on the OSD when this occurs? Is this pure HDD, or an SSD wal or db?
#5 Updated by Aleksei Zakharov over 5 years ago
iostat -xtd 1 output most of time when the problem occurs:
12/13/2018 11:43:20 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdn 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdp 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdm 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdr 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sds 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdt 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdu 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdv 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdw 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdx 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
It is similar on all 3 hosts.
CPU usage is about 0-1%:
~# pidstat 1 Linux 4.15.0-32-generic (ms2002) 12/13/2018 _x86_64_ (56 CPU) 11:56:21 AM UID PID %usr %system %guest %CPU CPU Command 11:56:22 AM 0 5247 0.98 0.00 0.00 0.98 13 ceph-osd 11:56:22 AM 0 5592 0.00 0.98 0.00 0.98 0 ceph-osd 11:56:22 AM 0 5847 0.98 0.00 0.00 0.98 29 ceph-osd 11:56:22 AM 0 6408 0.00 0.98 0.00 0.98 0 ceph-osd 11:56:22 AM 0 6854 0.98 0.00 0.00 0.98 14 ceph-osd 11:56:22 AM 0 7228 0.00 0.98 0.00 0.98 15 ceph-osd 11:56:22 AM 0 7534 0.98 0.00 0.00 0.98 34 ceph-osd 11:56:22 AM 0 7709 0.00 0.98 0.00 0.98 53 ceph-osd 11:56:22 AM 0 7854 0.00 0.98 0.00 0.98 14 ceph-osd 11:56:22 AM 0 23414 0.00 0.98 0.00 0.98 6 ceph-osd 11:56:22 AM 0 31130 0.00 0.98 0.00 0.98 21 pidstat 11:56:22 AM UID PID %usr %system %guest %CPU CPU Command 11:56:23 AM 0 5592 1.00 0.00 0.00 1.00 0 ceph-osd 11:56:23 AM 0 6012 1.00 0.00 0.00 1.00 29 ceph-osd 11:56:23 AM 0 6854 0.00 1.00 0.00 1.00 14 ceph-osd 11:56:23 AM 0 7113 0.00 1.00 0.00 1.00 27 ceph-osd 11:56:23 AM 0 7418 0.00 1.00 0.00 1.00 0 ceph-osd 11:56:23 AM 0 16101 1.00 0.00 0.00 1.00 14 ceph-osd 11:56:23 AM 0 17231 1.00 0.00 0.00 1.00 0 ceph-osd 11:56:23 AM 0 19993 1.00 0.00 0.00 1.00 27 ceph-osd 11:56:23 AM 0 31130 1.00 1.00 0.00 2.00 21 pidstat 11:56:23 AM UID PID %usr %system %guest %CPU CPU Command 11:56:24 AM 0 5129 1.00 0.00 0.00 1.00 35 ceph-osd 11:56:24 AM 0 5847 0.00 1.00 0.00 1.00 29 ceph-osd 11:56:24 AM 0 6408 0.00 1.00 0.00 1.00 0 ceph-osd 11:56:24 AM 0 6993 1.00 0.00 0.00 1.00 18 ceph-osd 11:56:24 AM 0 7113 1.00 0.00 0.00 1.00 27 ceph-osd 11:56:24 AM 0 31130 1.00 1.00 0.00 2.00 21 pidstat
We use ssd disks, osds are configured like this:
[osd.52] host = hostname devs = /dev/disk/by-id/scsi-330000d17022f8d03-part1 bluestore block path = /dev/disk/by-id/scsi-330000d17022f8d03-part2
So, you mean that this could be a kernel bug in IO layer?
#6 Updated by Aleksei Zakharov over 5 years ago
Thanks for your answers, that was helpful info.
It looks like aacraid module v.1.2.1.50877 issue. IO requests stucked under load, but iostat was not able to show this issue. After upgrade to 1.2.1.56008 I can't reproduce this behavior anymore.
#7 Updated by Igor Fedotov over 5 years ago
- Status changed from Need More Info to Closed