I've finally reproduced this(as i hope) behavior.
Our staging cluster:
3 nodes with 22 osds on ssd's,
Kernel 4.15.0-32-generic #35~16.04.1-Ubuntu
Cluster has old clients support:
require_min_compat_client hammer
min_compat_client hammer
require_osd_release luminous
and crush tunables are set to firefly.
~# ceph osd crush show-tunables
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 0,
"straw_calc_version": 1,
"allowed_bucket_algs": 22,
"profile": "firefly",
"optimal_tunables": 0,
"legacy_tunables": 0,
"minimum_required_version": "hammer",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 0,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 1,
"require_feature_tunables5": 0,
"has_v5_rules": 0
}
ceph version (clients,osds and monitors):
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
I run fio from 22 virtual machines with this command line:
fio --name=test --ioengine=libaio --iodepth=1 --direct=1 --filename=/root/fiofile --filesize=5G --runtime=120 --blocksize=4k --rw=randwrite
When it works in normal way - it shows 4Kiops from every virtual machine, which is qemu limit we've set.
1/2 times it stucks. sd 2:0:0:0: [sda] tag#1 abort messages could be found in VM dmesg. iostat -xtd 1 shows:
12/12/2018 12:47:23 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.06 0.00 0.00 0.00 0.00 101.01
When this happens, i mark all osds down on one host. For some reason, osd's stuck only on another host.
After i mark all osd's on the first host down, i wait some time and check ceph health detail:
~# ceph health detail
HEALTH_WARN noout,nodeep-scrub flag(s) set; Reduced data availability: 35 pgs inactive, 35 pgs peering; 40482 slow requests are blocked > 32 sec. Implicated osds 32,34,35,38,39,40,41,42,43,44,46,48,52,74,76,77,78,79,80,81,89,92
OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
PG_AVAILABILITY Reduced data availability: 35 pgs inactive, 35 pgs peering
pg 50.15 is stuck peering for 252.432610, current state peering, last acting [89,33,52]
pg 50.3c is stuck peering for 256.026446, current state peering, last acting [81,32,52]
pg 50.4d is stuck peering for 273.949284, current state peering, last acting [74,52,30]
pg 50.92 is stuck peering for 243.807450, current state peering, last acting [78,36,52]
pg 50.bb is stuck peering for 254.226990, current state peering, last acting [32,96,52]
pg 50.c1 is stuck peering for 224.870375, current state peering, last acting [41,52,93]
pg 50.d3 is stuck peering for 250.713368, current state peering, last acting [79,34,52]
pg 50.f0 is stuck peering for 226.680818, current state peering, last acting [41,52,80]
pg 50.106 is stuck peering for 253.985154, current state peering, last acting [92,52,33]
pg 50.130 is stuck peering for 203.517863, current state peering, last acting [48,52,75]
pg 50.137 is stuck peering for 206.796044, current state peering, last acting [46,52,92]
pg 50.149 is stuck peering for 226.680859, current state peering, last acting [41,52,74]
pg 50.161 is stuck peering for 224.242458, current state peering, last acting [42,52,90]
pg 50.169 is stuck peering for 235.408025, current state peering, last acting [38,79,52]
pg 50.183 is stuck peering for 234.849956, current state peering, last acting [38,52,76]
pg 50.1ee is stuck peering for 261.064377, current state peering, last acting [80,31,52]
pg 50.20a is stuck peering for 250.713153, current state peering, last acting [78,34,52]
pg 50.225 is stuck peering for 248.652645, current state peering, last acting [34,91,52]
pg 50.246 is stuck peering for 217.827022, current state peering, last acting [44,95,52]
pg 50.28b is stuck peering for 219.954617, current state peering, last acting [43,91,52]
pg 50.294 is stuck peering for 232.881190, current state peering, last acting [39,52,78]
pg 50.2ae is stuck peering for 230.378480, current state peering, last acting [40,75,52]
pg 50.2cd is stuck peering for 243.839058, current state peering, last acting [35,52,74]
pg 50.2d1 is stuck peering for 248.652355, current state peering, last acting [34,75,52]
pg 50.2f3 is stuck peering for 226.680902, current state peering, last acting [41,83,52]
pg 50.302 is stuck peering for 243.838994, current state peering, last acting [35,52,94]
pg 50.309 is stuck peering for 226.681108, current state peering, last acting [41,84,52]
pg 50.311 is stuck peering for 256.257842, current state peering, last acting [76,52,32]
pg 50.313 is stuck peering for 203.517920, current state peering, last acting [48,52,80]
pg 50.366 is stuck peering for 224.242326, current state peering, last acting [42,86,52]
pg 50.377 is stuck peering for 248.652813, current state peering, last acting [34,52,94]
pg 50.39d is stuck peering for 261.048597, current state peering, last acting [77,31,52]
pg 50.3b4 is stuck peering for 254.089555, current state peering, last acting [80,33,52]
pg 50.3cc is stuck peering for 203.518124, current state peering, last acting [48,95,52]
pg 50.3e4 is stuck peering for 243.839263, current state peering, last acting [35,83,52]
REQUEST_SLOW 40482 slow requests are blocked > 32 sec. Implicated osds 32,34,35,38,39,40,41,42,43,44,46,48,52,74,76,77,78,79,80,81,89,92
4265 ops are blocked > 524.288 sec
32214 ops are blocked > 262.144 sec
3922 ops are blocked > 131.072 sec
54 ops are blocked > 65.536 sec
27 ops are blocked > 32.768 sec
osds 32,34,35,38,39,40,41,42,43,44,46,48,52 have blocked requests > 262.144 sec
osds 74,76,77,78,79,80,81,89,92 have blocked requests > 524.288 sec
It's easy to see that pg's which are blocked in peering has osd.52 in acting set.
If i mark this osd (osd.52 in example) down or restart the service - problem disappears and pool operates normally.
If i wait, osd could start working in normal way, but sometimes, osd dies with:
1. A lot of no reply messages in logs:
...
-153> 2018-12-12 14:43:39.233249 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6844 osd.31 since back 2018-12-12 14:41:30.170055 front 2018-12-12 14:41:30.170055 (cutoff 2018-12-12
14:43:19.233213)
-152> 2018-12-12 14:43:39.233258 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6803 osd.33 since back 2018-12-12 14:41:41.685147 front 2018-12-12 14:41:41.685147 (cutoff 2018-12-12
14:43:19.233213)
-151> 2018-12-12 14:43:39.233284 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6805 osd.34 since back 2018-12-12 14:41:41.685147 front 2018-12-12 14:41:41.685147 (cutoff 2018-12-12
14:43:19.233213)
-150> 2018-12-12 14:43:39.233291 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6807 osd.35 since back 2018-12-12 14:41:46.388834 front 2018-12-12 14:41:46.388834 (cutoff 2018-12-12
14:43:19.233213)
-149> 2018-12-12 14:43:39.233301 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6809 osd.36 since back 2018-12-12 14:41:50.396253 front 2018-12-12 14:41:50.396253 (cutoff 2018-12-12
14:43:19.233213)
-148> 2018-12-12 14:43:39.233309 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6815 osd.37 since back 2018-12-12 14:41:50.396253 front 2018-12-12 14:41:50.396253 (cutoff 2018-12-12
14:43:19.233213)
-147> 2018-12-12 14:43:39.233318 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6811 osd.38 since back 2018-12-12 14:41:53.900087 front 2018-12-12 14:41:53.900087 (cutoff 2018-12-12
14:43:19.233213)
...
2. And a call trace after that:
0> 2018-12-11 19:32:09.119042 7f5b31b04700 -1 *** Caught signal (Aborted) **
in thread 7f5b31b04700 thread_name:tp_osd_tp
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
1: (()+0xa87324) [0x55b4cca24324]
2: (()+0x11390) [0x7f5b544a4390]
3: (pthread_cond_wait()+0xc0) [0x7f5b544a0360]
4: (Throttle::_wait(long)+0x34d) [0x55b4cca5c97d]
5: (Throttle::get(long, long)+0x2e5) [0x55b4cca5d7a5]
6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0xb55) [0x55
b4cc91e6c5]
7: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x66) [0x55b4cc63fb26]
8: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t cons
t&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcb8)
[0x55b4cc776278]
9: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x9fa) [0x55b4cc5dbc2a]
10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x139a) [0x55b4cc6244ca]
11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ff4) [0x55b4cc627e24]
12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xea9) [0x55b4cc5e1f69]
13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55b4cc4555f9]
14: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55b4cc703327]
15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1047) [0x55b4cc483a17]
16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55b4cca6c204]
17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b4cca6f240]
18: (()+0x76ba) [0x7f5b5449a6ba]
19: (clone()+0x6d) [0x7f5b5351141d]
There're no error messages in dmesg on host, there's no heavy load. It looks like host do nothing at that moment.
I've attached gdb backtrace.
Operations dump is too big (1.6M tar), so it is available here - https://166924.selcdn.ru/links/osd.52.operations.dump.tar.gz
In our production cluster where we see the problem often, on client side we use:
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
And osds version is:
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
~# ceph osd crush show-tunables
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 0,
"straw_calc_version": 2,
"allowed_bucket_algs": 54,
"profile": "hammer",
"optimal_tunables": 0,
"legacy_tunables": 0,
"minimum_required_version": "hammer",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 0,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 1,
"require_feature_tunables5": 0,
"has_v5_rules": 0
}
We can't see any blocked ops when problem appears in production. Also, latency metrics are ok, but some virtual machines stuck on IO operations the same way, as it is described before. We've never seen that trace call in production. Probably, it depends on the load.
If any additional info is needed - please, ask.