Bug #43382
medium io/system load causes quorum failure
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We just found out that if you put some io pressure on your system by e.g. big rsync, the mon process has issues probably due to load of compaction.
2019-12-19 08:21:21.902 7feddd989700 4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting 2019-12-19 08:21:21.902 7feddd989700 4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting 2019-12-19 08:22:49.555 7fede2192700 1 mon.m3-1045556@2(peon).osd e39485 e39485: 136 total, 136 up, 136 in 2019-12-19 08:22:52.791 7fede2192700 1 mon.m3-1045556@2(peon).osd e39486 e39486: 136 total, 136 up, 136 in 2019-12-19 08:22:53.211 7fede699b700 0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >> conn(0x55c5975a1a80 0x55c57d14c000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.32:33376/0) 2019-12-19 08:22:54.727 7fede2192700 1 mon.m3-1045556@2(peon).osd e39487 e39487: 136 total, 136 up, 136 in 2019-12-19 08:24:12.091 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:24:17.095 7fede4997700 -1 mon.m3-1045556@2(electing) e1 get_health_metrics reporting 1 slow ops, oldest is monmgrreport(6 checks, 0 progress events) 2019-12-19 08:24:18.091 7fede4997700 1 mon.m3-1045556@2(electing).elector(205) init, last seen epoch 205, mid-election, bumping 2019-12-19 08:24:30.651 7fede699b700 1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id 2019-12-19 08:24:30.655 7fede699b700 1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id 2019-12-19 08:24:30.759 7fede699b700 1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id 2019-12-19 08:24:30.763 7fede699b700 1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id 2019-12-19 08:24:30.819 7fede699b700 1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id 2019-12-19 08:24:30.843 7fede699b700 1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id 2019-12-19 08:24:39.123 7fede4997700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:24:39.123 7fede4997700 -1 mon.m3-1045556@2(electing) e1 get_health_metrics reporting 29 slow ops, oldest is monmgrreport(6 checks, 0 progress events) 2019-12-19 08:24:39.139 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:24:39.167 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:24:55.479 7fede619a700 0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >> conn(0x55c59737a880 0x55c58a56b800 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.36:39560/0) 2019-12-19 08:25:00.375 7fede699b700 0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >> conn(0x55c56b684480 0x55c57dedd800 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.77:35404/0) 2019-12-19 08:25:04.892 7fede4997700 1 mon.m3-1045556@2(peon).paxos(paxos updating c 11938565..11939174) lease_timeout -- calling new election 2019-12-19 08:25:04.892 7fede2192700 0 log_channel(cluster) log [INF] : mon.m3-1045556 calling monitor election 2019-12-19 08:25:04.892 7fede2192700 1 mon.m3-1045556@2(electing).elector(210) init, last seen epoch 210 2019-12-19 08:25:04.896 7fede699b700 1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id 2019-12-19 08:25:04.912 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:25:04.916 7fede699b700 1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id 2019-12-19 08:25:04.932 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:25:04.932 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:25:52.820 7fede4997700 1 mon.m3-1045556@2(peon).paxos(paxos updating c 11938565..11939187) lease_timeout -- calling new election 2019-12-19 08:25:52.848 7fede2192700 0 log_channel(cluster) log [INF] : mon.m3-1045556 calling monitor election 2019-12-19 08:25:52.848 7fede2192700 1 mon.m3-1045556@2(electing).elector(216) init, last seen epoch 216 2019-12-19 08:25:52.848 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:25:52.888 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:25:52.892 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:26:16.976 7fede4997700 1 mon.m3-1045556@2(peon).paxos(paxos updating c 11938565..11939217) lease_timeout -- calling new election 2019-12-19 08:26:16.976 7fede2192700 0 log_channel(cluster) log [INF] : mon.m3-1045556 calling monitor election 2019-12-19 08:26:16.976 7fede2192700 1 mon.m3-1045556@2(electing).elector(222) init, last seen epoch 222 2019-12-19 08:26:16.976 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:26:16.996 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:26:35.928 7fede4997700 1 mon.m3-1045556@2(peon).paxos(paxos updating c 11938565..11939230) lease_timeout -- calling new election 2019-12-19 08:26:35.928 7fede2192700 0 log_channel(cluster) log [INF] : mon.m3-1045556 calling monitor election 2019-12-19 08:26:35.928 7fede2192700 1 mon.m3-1045556@2(electing).elector(228) init, last seen epoch 228 2019-12-19 08:26:35.976 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:26:44.212 7fede1190700 0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >> conn(0x55c56056b600 0x55c589833000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.41:40376/0) 2019-12-19 08:26:55.164 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:27:00.736 7fede619a700 0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >> conn(0x55c57e293a80 0x55c58bb3d000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.32:39392/0) 2019-12-19 08:27:01.164 7fede4997700 1 mon.m3-1045556@2(electing).elector(237) init, last seen epoch 237, mid-election, bumping 2019-12-19 08:27:07.092 7fede4997700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:27:07.108 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model 2019-12-19 08:28:01.213 7fede699b700 0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >> conn(0x55c593335a80 0x55c589833000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.77:40260/0) 2019-12-19 08:28:31.785 7feddf98d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2019-12-19 08:28:31.785 7feddf98d700 4 rocksdb: [db/db_impl.cc:778] ** DB Stats ** Uptime(secs): 4837204.8 total, 600.0 interval Cumulative writes: 6323K writes, 32M keys, 6323K commit groups, 1.0 writes per commit group, ingest: 216.44 GB, 0.05 MB/s Cumulative WAL: 6323K writes, 6323K syncs, 1.00 writes per sync, written: 216.44 GB, 0.05 MB/s Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 597 writes, 3177 keys, 597 commit groups, 1.0 writes per commit group, ingest: 28.92 MB, 0.05 MB/s Interval WAL: 597 writes, 597 syncs, 1.00 writes per sync, written: 0.03 MB, 0.05 MB/s Interval stall: 00:00:0.000 H:M:S, 0.0 percent ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 123.3 123.3 0.0 1.0 0.0 238.7 528.94 276.72 25008 0.021 0 0 L6 4/0 227.33 MB 0.0 3855.1 123.3 3731.8 3731.9 0.1 0.0 30.3 340.9 330.0 11580.84 4294.30 24998 0.463 172M 13M Sum 4/0 227.33 MB 0.0 3855.1 123.3 3731.8 3855.2 123.4 0.0 31.3 326.0 326.0 12109.78 4571.03 50006 0.242 172M 13M Int 0/0 0.00 KB 0.0 0.5 0.0 0.4 0.5 0.0 0.0 47.1 329.0 329.9 1.43 0.54 4 0.357 39K 849 ** Compaction Stats [default] ** Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Low 0/0 0.00 KB 0.0 3855.1 123.3 3731.8 3731.9 0.1 0.0 0.0 340.9 330.0 11580.84 4294.30 24998 0.463 172M 13M High 0/0 0.00 KB 0.0 0.0 0.0 0.0 123.3 123.3 0.0 0.0 0.0 238.7 528.92 276.72 25007 0.021 0 0 User 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 324.3 0.01 0.00 1 0.012 0 0 Uptime(secs): 4837204.8 total, 600.0 interval Flush(GB): cumulative 123.309, interval 0.010 AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files): cumulative 0, interval 0 AddFile(L0 Files): cumulative 0, interval 0 AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 3855.25 GB write, 0.82 MB/s write, 3855.11 GB read, 0.82 MB/s read, 12109.8 seconds Interval compaction: 0.46 GB write, 0.79 MB/s write, 0.46 GB read, 0.78 MB/s read, 1.4 seconds Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count ** File Read Latency Histogram By Level [default] ** ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 123.3 123.3 0.0 1.0 0.0 238.7 528.94 276.72 25008 0.021 0 0 L6 4/0 227.33 MB 0.0 3855.1 123.3 3731.8 3731.9 0.1 0.0 30.3 340.9 330.0 11580.84 4294.30 24998 0.463 172M 13M Sum 4/0 227.33 MB 0.0 3855.1 123.3 3731.8 3855.2 123.4 0.0 31.3 326.0 326.0 12109.78 4571.03 50006 0.242 172M 13M Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
Very concerning given the hardware load is only about 0.5 and the disks are SSDs.
History
#1 Updated by Anonymous over 4 years ago
Or due to limited bandwidth? 10G NICs dedicated.
#2 Updated by Anonymous about 4 years ago
couldnt reproduce, close